Welcome to Data Visualization in R! We’ll be using a dataset from the Pew Research Center’s Forum on Religion & Public Life, a country-wide survey from 2014 Americans’ religious beliefs. You can download the data from Pew or my GitHub page (sorry Pew!).
Importing data looks like this:
library(haven)
pew <- read_sav("Pew-Research-Center-2014-U.S.-Religious-Landscape-Study/Dataset - Pew Research Center 2014 Religious Landscape Study National Telephone Survey - Version 1.1 - December 1 2016.sav")
To visualize this data we will be using a package called ggplot2, which is included in a package of packages called tidyverse:
library(tidyverse)
While we’re all getting set up, here’s a roadmap for the afternoon:
The Pew data comes with a codebook from which we’ll pick out a few variables to work with. If you want to see some information about the variables at once, you could do some of these things:
summary(pew)
pairs(pew)
corr(pew)
… but you don’t want to do any of those things in a dataset with this many variables. You can check out the names and basic structure of a dataset with str().
str(pew)
# A quick note here: eval=FALSE will prevent the code chunk from providing any output into your markdown document.
Let’s start with the basics: a histogram of religious affiliation, the simplest version of this is the variable “RELTRAD.”
ggplot(pew, aes(RELTRAD)) + geom_bar()
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.
That didn’t work! Why not? This a pretty normal mistake that I regularly make: The data I want to use is stored as integers rather than text (characters), and different geoms require different kinds of data. If we look at the data in the codebook, we also see that the data is in a format unhelpful to our visualization. We need to use the codebook to transform the data into something useful, using the mutate and case_when commands from dplyr, another package in the tidyverse.
pew %>%
mutate(
RELTRAD2 = case_when(
RELTRAD == 1100 ~ "Evangelical Protestant Tradition",
RELTRAD == 1200 ~ "Mainline Protestant Tradition",
RELTRAD == 1300 ~ "Historically Black Protestant Tradition",
RELTRAD == 10000 ~ "Catholic",
RELTRAD == 20000 ~ "Mormon",
RELTRAD == 30000 ~ "Orthodox Christian",
RELTRAD == 40001 ~ "Jehovah's Witness",
RELTRAD == 40002 ~ "Other Christian",
RELTRAD == 50000 ~ "Jewish",
RELTRAD == 60000 ~ "Muslim",
RELTRAD == 70000 ~ "Buddhist",
RELTRAD == 80000 ~ "Hindu",
RELTRAD == 90001 ~ "Other World Religions",
RELTRAD == 90002 ~ "Other Faiths",
RELTRAD == 100000 ~ "Unaffiliated (religious 'nones')",
RELTRAD == 900000 ~ "Don't know/refused - no information on religious identity")) -> pew
ggplot(pew, aes(RELTRAD2)) + geom_bar()
That’s still pretty terrible looking. We want to do something like this:
Let’s clean our plot up with code options from ggplot. Run the code below in stages: After geom_bar(), highlight one additional row every time you rune the code to watch what happens.
ggplot(pew, aes(x = fct_infreq(RELTRAD2))) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(y="Number of Respondents", x="Religious Tradition") +
ggtitle( "Prevalence of Religions in the United States", subtitle = "Source: Pew Religion Survey, 2014") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Better! Let’s try another one. We know now that we’ll need to do some mutating in order to take a look at the relationship between age and ideology in the survey:
pew %>%
mutate(
agerec2 = case_when(
agerec == 1 ~ "Age 24 or younger",
agerec == 2 ~ "Age 25-29",
agerec == 3 ~ "30-34",
agerec == 4 ~ "35-39",
agerec == 5 ~ "40-44",
agerec == 6 ~ "45-49",
agerec == 7 ~ "50-54",
agerec == 8 ~ "55-59",
agerec == 9 ~ "60-64",
agerec == 10 ~ "65-69",
agerec == 11 ~ "70-74",
agerec == 12 ~ "75-79",
agerec == 13 ~ "80-84",
agerec == 14 ~ "85-89",
agerec == 15 ~ "Age 90 or older",
agerec == 99 ~ "NA")
) -> pew
ggplot(pew, aes(agerec2)) + geom_bar()
Not cool. Let’s clean things up a bit and plot again:
pew %>%
mutate(
agerec2 = case_when(
agerec == 1 ~ "24 or younger",
agerec == 2 ~ "25-29",
agerec == 3 ~ "30-34",
agerec == 4 ~ "35-39",
agerec == 5 ~ "40-44",
agerec == 6 ~ "45-49",
agerec == 7 ~ "50-54",
agerec == 8 ~ "55-59",
agerec == 9 ~ "60-64",
agerec == 10 ~ "65-69",
agerec == 11 ~ "70-74",
agerec == 12 ~ "75-79",
agerec == 13 ~ "80-84",
agerec == 14 ~ "85-89",
agerec == 15 ~ "90 or older",
agerec == 99 ~ "NA"),
ideo2 = case_when(
ideo == 1 ~ "1 - Very conservative",
ideo == 2 ~ "2 - Conservative",
ideo == 3 ~ "3 - Moderate",
ideo == 4 ~ "4 - Liberal",
ideo == 5 ~ "5 - Very liberal",
ideo == 9 ~ "NA")) -> pew
ggplot(subset(pew, agerec2!="NA" & ideo2!="NA"), aes(agerec2, fill = ideo2)) +
geom_bar() +
coord_flip() +
theme_minimal() +
labs(y="Number of Respondents", x="Age Group", fill="Ideology") +
ggtitle( "Political Identification by Age", subtitle = "Source: Pew Religion Survey, 2014") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
scale_fill_manual(breaks = c("1 - Very conservative", "2 - Conservative", "3 - Moderate","4 - Liberal","5 - Very liberal"), values=c("red","dark red","purple4","dark blue","blue"))
Now we’re getting somewhere!
Now, just for kicks: When I’m well-caffeinated, I always take a first stab at one thing that I’d like to learn how to do but don’t understand yet. This time I wanted to try gganimate:
library(gganimate)
p <- ggplot(subset(pew, agerec2!="NA" & ideo2!="NA"), aes(ideo2, fill = ideo2)) +
geom_bar() +
coord_flip() +
theme_minimal() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
labs(y="Number of Respondents", x="") +
ggtitle( "Political Identification by Age", subtitle = "Source: Pew Religion Survey, 2014") +
scale_fill_manual(breaks = c("1 - Very conservative", "2 - Conservative", "3 - Moderate","4 - Liberal","5 - Very liberal"), values=c("red","dark red","purple4","dark blue","blue"))
p + transition_states(agerec2) +
labs(title = "Americans' Political Ideology by Age: {closest_state}")
More on ggplot and tidyverse:
Just some of the things I had to search for:
As usual, a massive thanks to Matt Worthington for all his help.